BioCreative II Gene Mention Tagging System at IBM Watson
نویسنده
چکیده
This paper describes our system developed for the BioCreative II gene mention tagging task. The goal of this task is to annotate mentions of genes or gene products in the given Medline sentences. Our focus was to experiment with a semi-supervised learning method, Alternating Structure Optimization (ASO) [1], by which we exploited a large amount of unlabeled data in addition to the labeled training data provided by the organizer. The system is also equipped with automatic induction of high-order features, gene name lexicon lookup, classifier combination, and simple postprocessing. Our system appears to be competitive. All of our three official runs belong to the Quartile 1. 1 Gene mention tagging system Our gene mention tagging system was built on top of a named entity chunking system described in [1], which was used for annotating names of persons, organizations, and so forth. This system casts the chunking task into that of sequential labeling, as is commonly done, by encoding chunk information into token tags. It uses a regularized linear classifier with modified Huber loss and the 2-norm regularization. That is, using the ‘oneversus-all’ scheme, we train binary classifiers, one for each token tag, using n labeled data points {(xi, yi)} for i = 1, . . . , n by: ŵ = arg minw ∑ n i=1 L(wTxi, yi) + λ||w|| 2 . The regularization parameter λ is set to 10. L is the loss function: L(p, y) = max(0, 1 − py) if py ≥ −1; and −4py otherwise. The optimization is done by stochastic gradient descent. Viterbi-style dynamic programming is performed to find the token tag sequence with the largest confidence. Feature types are shown in Figure 1. Using this framework, we experimented with additional resources and algorithms, which we describe below. · words, parts-of-speech, character types, 4 characters at beginning/ending in a 5-word window · words in a 3-syntactic chunk window. · labels assigned to two words on the left. · bi-grams of the current word and the left label. · labels assigned to previous occurrences of the current word. Figure 1: Feature types. 1.1 Exploiting unlabeled data through Alternating Structure Optimization (ASO) ASO is a multi-task learning algorithm that seeks to improve performance on individual tasks by simultaneously learning multiple tasks that are related to each other. The application of ASO to semi-supervised learning involves automatic generation of thousands of prediction problems (called ‘auxiliary problems’) and their labeled data from unlabeled data, so that the multi-task learning algorithm can be applied on the unlabeled data. To put this into perspective, ASO-based semi-supervised learning can be viewed as learning new (and better) feature representation from unlabeled data. This is done by learning auxiliary predictors that predict one part of the feature vectors from another part of the feature vectors, which can be learned from unlabeled data. Under certain conditions, it can be shown that learning auxiliary predictors of this type can reveal the predictive structure (something useful for the target prediction problems) underlying the data. The final classifiers are trained with labeled data using the original features and the new features learned from unlabeled data. Since modern classifiers based on empirical risk minimization are capable of ignoring irrelevant features to some degree, the risk of using unlabeled data this way is relatively low, and its potential gain is large. [1] should be consulted for the details of ASO. Below, we only describe the specifics of our setting.
منابع مشابه
Graph-based Semi-supervised Gene Mention Tagging
The rapidly growing biomedical literature has been a challenging target for natural language processing algorithms. One of the tasks these algorithms focus on is called named entity recognition (NER), often employed to tag gene mentions. Here we describe a new approach for this task, an approach that uses graphbased semi-supervised learning to train a Conditional Random Field (CRF) model. Bench...
متن کاملEvaluating gene/protein name tagging and mapping for article retrieval
Background: Tagging gene/protein names in text and mapping them to database entries are critical tasks in biological literature mining. Most of the existing tagging and normalization approaches, however, have not been evaluated for practical use in article retrieval towards efficient biocuration. Results: By utilizing literature cross-reference information provided by NCBI Entrez Gene database,...
متن کاملAnalysis and Enhancement of Conditional Random Fields Gene Mention Taggers in BioCreative II Challenge Evaluation
Background: Tagging gene and gene product mentions in scientific text is an important initial step of literature mining. In BioCreative 2 challenge, the conditional random fields model (CRF) was the most prevailing method in the gene mention task. In this paper, we analyze two best performing CRF-based systems in BioCreative 2. We examine their key claims and propose enhancement based on the an...
متن کاملIntegrating high dimensional bi-directional parsing models for gene mention tagging
MOTIVATION Tagging gene and gene product mentions in scientific text is an important initial step of literature mining. In this article, we describe in detail our gene mention tagger participated in BioCreative 2 challenge and analyze what contributes to its good performance. Our tagger is based on the conditional random fields model (CRF), the most prevailing method for the gene mention taggin...
متن کاملBCC-NER: bidirectional, contextual clues named entity tagger for gene/protein mention recognition
Tagging biomedical entities such as gene, protein, cell, and cell-line is the first step and an important pre-requisite in biomedical literature mining. In this paper, we describe our hybrid named entity tagging approach namely BCC-NER (bidirectional, contextual clues named entity tagger for gene/protein mention recognition). BCC-NER is deployed with three modules. The first module is for text ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007